A novel changepoint detection algorithm 



Allen B. Downey 



00 
O 
O 
(N 

o 

Q 



< 



> 

cn 

00 

o 

> 

X 
J3 



Abstract 

We propose an algorithm for simultaneously detect- 
ing and locating changepoints in a time series, and a 
framework for predicting the distribution of the next 
point in the series. The kernel of the algorithm is a 
system of equations that computes, for each index 
i, the probability that the last (most recent) change 
point occurred at i. We evaluate this algorithm by 
applying it to the change point detection problem 
and comparing it to the generalized likelihood ra- 
tio (GLR) algorithm. We find that our algorithm 
is as good as GLR, or better, over a wide range of 
scenarios, and that the advantage increases as the 
signal-to-noise ratio decreases. 

1 Introduction 

Predicting network performance characteristics is 
useful for a variety of applications. At the protocol 
level, predictions are used to set protocol parameters 
like timeouts and send rates [IT]. In middleware, 
they are used for resource selection and scheduling 
|14j . And at the application level, they are used for 
predicting transfer times and choosing among mir- 
ror sites. 

Usually these predictions are based on recent mea- 
surements, so their accuracy depends on the con- 
stancy of network performance. For some network 
characteristics, and some time scales, Internet mea- 
surements can be described with stationary models, 
and we expect simple prediction methods to work 
well. But many network measurements are char- 
acterized by periods of stationarity punctuated by 
abrupt transitions [15] [9]. On short time scales, 
these transitions are caused by variations in traffic 
(for example, the end of a high-throughput trans- 
fer). On longer time scales, they are caused by 
routing changes and hardware changes (for exam- 
ple, routers and links going down and coming back 
up). 



These observations lead us to explore statistical 
techniques for changepoint detection. Changepoint 
analysis is based on the assumption that the ob- 
served process is stationary during certain intervals, 
but that there are changepoints where the parame- 
ters of the process change abruptly. If we can detect 
changepoints, we know when to discard old data and 
estimate the new parameters. 

Thus, changepoint analysis can improve the ac- 
curacy of predictions. It can also provide meta- 
predictions; that is, it can be used to indicate when 
a prediction is likely to be accurate. 

Statisticians have done extensive work on the gen- 
eral problem of changepoint detection. Variations 
on the changepoint problem [2] include: 

Changepoint detection: The simplest version of 
the changepoint problem is testing the hypoth- 
esis that a changepoint has occurred. Elemen- 
tary algorithms for change detection include 
control charts, CUSUM algorithms and likeli- 
hood ratio tests [T]. 

Location estimation: For some applications it is 
important to estimate the location (time) of the 
changepoint. More generally, it might be useful 
to find a confidence set or an interval of times 
that contain a changepoint at some level of con- 
fidence. 

Tracking: The goal of the tracking problem is to 
partition a time series into stationary intervals 
and estimate the parameters of the process in 
each interval. A simple approach is to use hy- 
pothesis testing to detect a changepoint, esti- 
mate the location of the change(s), and then 
use conventional techniques to estimate the pa- 
rameters of the process in each interval. 

For each of these problems, there is an alternative 
Bayesian formulation. Instead of testing (and pos- 
sibly rejecting) the hypothesis that a changepoint 
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has occurred, the Baycsian approach is to compute 
the subjective probability of the hypothesis. Instead 
of estimating the location, the Bayesian approach 
is to compute P{i), the probability that there is a 
changepoint at index i. The Bayesian version of the 
tracking problem is to compute the probability of 
a changepoint at a given time and the conditional 
distribution of the parameters (conditioned on the 
location) . 

In this paper, we take the Bayesian approach and 
apply it to the tracking problem. Our algorithm can 
also be used for changepoint detection and location 
estimation. 

1.1 Online or off? 

Changepoint detection algorithms are sometimes 
classified as "online" or "offline", but this is not an 
intrinsic characteristic of the algorithm; it is a de- 
scription of the deployment. Online algorithms run 
concurrently with the process they are monitoring, 
processing each data point as it becomes available, 
with the real-time constraint that processing should 
complete before the next data point arrives. Offline 
algorithms consider the entire data set at once, and 
there is no real-time constraint on the run time. 

So in most cases, "online" means "fast enough to 
keep up" . Sometimes it means "incremental" , in the 
sense that when a new data point arrives, the algo- 
rithm can perform an update, based on previously 
stored results, faster than it could recompute from 
scratch. Incremental algorithms are particularly ap- 
pealing in embedded applications, where processing 
power is at a premium, but it is not generally nec- 
essary for online algorithms to be incremental. 

"Online" and "offline" are probably best used to 
describe the problem formulation rather than the 
algorithm. In the online scenario, the goal is to de- 
tect a changepoint as soon as possible after it oc- 
curs, while minimizing the rate of false alarms. In 
the offline scenario, the goal is often to identify all 
changepoints in a sequence, and the the criteria are 
sensitivity (detecting true changepoints) and speci- 
ficity (avoiding false positives). 

The algorithm we are proposing can be used for 
both online and offline problems. It is incremental, 
so it lends itself to online deployment; on the other 
hand, it requires computation time proportional to 
for each data point, where n is the number of 



points in the current window, so for some applica- 
tions it may not be fast enough to satisfy real time 
requirements. 

Nevertheless, since the applications we are con- 
sidering (maJiing real time predictions) are online, 
we evaluate our algorithm by the criteria of online 
problems. 

1.2 Prediction with changepoints 

The goal of this work is to generate a predictive dis- 
tribution from a time series with changepoints. For 
example, imagine that we have observed n values 
xi . . . Xn, and we want to predict the value of x^+i. 
In the absence of changepoints, we might model the 
series as a random variable X with a stationary dis- 
tribution (and, possibly, autocorrelation structure). 
We could use the observed values to estimate the 
parameters of X and then use those parameters to 
compute the distribution of the next value. 

If we know that there is a changepoint, and we 
assume that there is no relation between the pa- 
rameters of the process before and after the change, 
then the data before the changepoint are irrelevant 
for making predictions. 

But in the context of network measurements, we 
usually don't know when a changepoint occurs. We 
can only infer it from the time series itself, and we 
usually don't know with certainty whether it oc- 
curred or exactly when. 

There are two ways to handle this uncertainty. 
The most common approach is to compute from the 
data a functional that is likely to increase after a 
changepoint and to compare the resulting values to a 
previously-chosen threshold. If the value exceeds the 
threshold, we assume that there was a changepoint 
and discard older data. Otherwise we assume that 
there was no changepoint. 

The alternative, which we explore here, is to use 
changepoint probabilities to generate predictive dis- 
tributions. For example, if we think that there is 
p chance of a changepoint at i, then we can gener- 
ate two distributions, one that assumes that there 
was a changepoint, and one that assumes that there 
wasn't, and then mix the distributions using p as a 
weight. 

Therefore, our intermediate goal is to compute 
P{i~^), which we define as the probability that i 
is the index of the last (most recent) changepoint. 
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The + notation is intended to distinguish from P{i), 
which is the probabihty that i is the index of a 
changepoint (but not necessarily the last). 



1.3 Related Work 

In a seminal paper on the tracking problem, Cher- 
noff and Zacks present a Bayesian estimator for the 
current mean of a process with abrupt changes [1]. 
Like us, they start with an estimator that assumes 
there is at most one change, and then use it to gen- 
erate an approximate estimator in the general case. 
Their algorithm makes the additional assumption 
that the change size is distributed normally; our al- 
gorithm does not require this assumption. Also, our 
algorithm generates a predictive distribution for the 
next value in the series, rather than an estimate of 
the current mean. 

Since then, there has been extensive work on a va- 
riety of related problems, using both Bayesian and 
non-Bayesian approaches. More recent works in- 
clude extensions to deal with multiple changepoints, 
autoregressive processes, and long-range dependent 
processes. 

The algorithm we propose can be extended to de- 
tect changes in the variance as well as the mean of a 
process. This kind of changepoint has received rel- 
atively little attention; one exception is recent work 
by Jandhyala, Fotopoulos and Hawkins [8j. 

Most recently, Vellekoop and Clark propose a non- 
linear filtering approach to the changepoint detec- 
tion problem (but not estimation or tracking) [13j . 

We are aware of a few examples where these 
techniques have been applied to network measure- 
ments. Blazek et al explore the use of change-point 
algorithms to detect denial of service attacks 
Similarly Deshpande, Thottan and Sikdar use non- 
parametric CUSUM to detect BGP instabilities [6]. 

In the context of large databases, Kifer, Ben- 
David and Gehrke propose an algorithm for detect- 
ing changes in data streams [1^. It is based on a 
two-window paradigm, in which the distribution of 
values in the current window is compared to the dis- 
tribution of values in a past reference window. This 
approach is appropriate when the number of points 
between changepoints is large and alarm delay is not 
a critical metric. 



1.4 Outline 

Before presenting our algorithm we provide a sum- 
mary of the techniques and results we will use from 
Bayesian statistics (Section [2]). Then we develop 
our algorithm, starting with a direct method for the 
case where we know there is exactly one change- 
point, and then using that result to develop an it- 
erative method for the case where the number of 
changepoints is unknown (Section [3|) . That section 
includes synthetic examples to demonstrate changes 
in both mean and variance, and one real series, the 
ubiquitous Nile dataset. Finally, we apply our al- 
gorithm to the changepoint detection problem and 
compare it with GLR, a standard algorithm for this 
problem (Section S]). 

2 Bayesian Statistics 

This section presents an introduction to Bayesian 
statistics, and develops several examples we will use 
in later sections. 

The fundamental idea behind all Bayesian statis- 
tics is the diachronic interpretation of Bayes' Theo- 
rem: 

where H is a hypothesis and E is a relevant body of 
evidence. P{H\E) is the probability of the hypoth- 
esis given the evidence, which is called the posterior 
probability. P{H) is the probability of the hypoth- 
esis before we take account of the evidence, known 
as the prior probability. P{E\H) is the probability 
of seeing the evidence, given that the hypothesis is 
true; this term is sometimes called the likelihood. Fi- 
nally, P{E) is the probability of seeing the evidence 
under any possible hypothesis, called the normaliz- 
ing constant. 

If the hypotheses we are interested in are stochas- 
tic processes, it is usually straightforward to com- 
pute P{E\H). The prior probability P{H) is either 
based on a previous computation or reflects a sub- 
jective degree of belief in the hypothesis before see- 
ing the evidence. The normalizing constant is hard 
to formulate in general, but if we can identify a mu- 
tually exclusive and complete set of hypotheses, S, 
then 
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P(E) 



E 



P{E\Hi)P{Hi 



(2) 



In practice it is sufficient to compute any likelihood 
function, L{E\H), such that 



L{E\H) = kP{E\H)P{H) 



(3) 



where k is an arbitrary constant that drops out when 
we compute 



P{H\E) 



L{E\H) 



(4) 



As an example, suppose that H is the hypothesis 
that a random variable X is distributed N(fi,a'^) 
with known /i and a"^. In a continuous distribution, 
the probability of a given selection X = x is not well 
defined, but since the probability density function 
pdfx{x) is proportional to P{X = x\H), we can use 
it as a likelihood function: 



L{X = x\H)=pdfx{x) 
1 



exp 



1 / X — /X 

2 V a 



(5) 



crV2vr 

For a sample E = Xi with i = 1 . . .n 

L{E\H) = llpdfxixi) (6) 

2.1 Posterior distributions 

We have been using the notation P{A) to mean the 
probability of an event A. Following the notation of 
Gelman et al [7], we use p{x) as a shorthand for the 
probability density function pdfx{x). This notation 
makes it convenient to write Bayes' Theorem for dis- 
tributions. For example, if we observe a set of data, 
y, we might want to compute the distribution of 9, 
which is the set of parameters of the process that 
produced y. Generalizing Bayes' Theorem yields 



p{0\y) 



p{y\O)pi0) 
p{y) 



(7) 



where p{9) is the prior distribution of 9, and p{9\y) 
is the posterior distribution. 

As an example, suppose that y is a set of inter- 
arrival times between passengers at a bus stop, and 
the arrival process is Poisson; in this case 9 is the 



single parameter of the Poisson model (the mean 
arrival rate), and p{9\y) is the distribution of 9, tak- 
ing the observed data into account. In frequentist 
statistics, 9 is not a random variable, so it is not 
meaningful to talk about its distribution, but in the 
Bayesian interpretation, probability can refiect a de- 
gree of belief about a non-random quantity. 

As another example, which we will use in Sec- 
tion [3l suppose that we have a set of observations y 
that come from a normal distribution with unknown 
mean /i and variance o"^. We would like to compute 
posterior distributions for /x and a. For this problem 
it is common to use a noninformative prior distribu- 
tion that is uniform on (/x, log(o")), in which case 
the posterior distribution of cj^ is a scaled inverse 
chi-square distribution: 



(8) 



where n is the number of observations in y and 
is the estimated variance, 



n 



1 " 



y? 



(9) 



where y is the estimated mean ^ yi- 

Then for a given value of a the conditional poste- 
rior distribution of ^ is Gaussian: 



(10) 



For the derivation of this result, see Gelman et al 
[7], pages 74-76. 

3 Estimating P{i^) 

Imagine that we have observed a series of n values. 
We would like to estimate P(i^), the probability 
that i is the index of the last (most recent) change- 
point, for all i. 

We start with a simplified version of the problem 
and work up to the most general version. To get 
started, we assume 

1. Exactly one changepoint occurred during the 
observed interval, and it is equally likely to be 
anywhere in the interval. 

2. Between changepoints, the distribution of val- 
ues is normal with constant mean and variance. 
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3. Before the changepoint, the mean is known to 
be fiQ. After the changepoint the mean is un- 
known. 

4. The variance is known to be cr^ both before and 
after the changepoint. 

The first assumption is useful because it means 
that the series of hypotheses, i"*", is mutually ex- 
clusive and complete, which makes it practical to 
compute the normalizing constant in Equation [TJ 
The other assumptions make it easy to compute a 
likelihood function. 

In the following sections, we present an algorithm 
for this simplified problem and then gradually relax 
the assumptions. 

3.1 Exactly one changepoint 

We would like to compute P{E\i^), which is the 
probability of seeing the observed data E given the 
hypothesis i^. If we know there is only one change- 
point in an interval, i"*" is equivalent to the hypothe- 
sis that the only changepoint is at i, so the likelihood 
function is 

i n 

J{P{xj\iio,a^) J] P(x,|/zi,a2) (11) 
j=i i=«+i 

where /xi is the mean after the changepoint. Since 
/xi is unknown, it has to be estimated from the data 
Xj+i . . .Xn- Given the sample mean and variance, we 
can compute the posterior distribution p{^i) using 
Equation [lOl and then 

P{E\i+) = J P{E\i+,^lo,|^l,a^)p{^ll)d^Il (12) 

It is not practical to evaluate this integral, but since 
the algorithm we are proposing is iterative, we can 
approximate it by making a random selection from 
p(/ii) during each iteration. 

If the changepoint is equally likely to occur any- 
where, the prior probabilities are all equal, so they 
drop out of the computation. The set of hypotheses 
is exclusive and complete, so the normalizing con- 
stant is 

n 

P{E) = Y,P{E\^^) (13) 

4 = 1 



3.2 Zero or one changepoints 

If we know that there are zero or one changepoints 
in the interval, we can generalize this technique by 
adding Hq, which is the hypothesis that there are 
no changepoints. 

If we assume that the probability of a changepoint 
at any time is /, then the prior probabilitieqll are 

P(i+) = /(I -/)"-! 
P(i?o) = (!-/)" 

The likelihood of Hq is 

n 

P{E\Ho, 110,(1^) = J{P{x,\iio,(J^) (15) 
i=i 

Again, we have a set of hypotheses that is exclusive 
and complete, so the normalizing constant is 

n 

P{E) = P{E\Ho)P{Ho) + P{E\i+)P(i+) (16) 

i=l 

3.3 Iterative algorithm 

As the next step, we drop the assumption that we 
know that exactly one changepoint has occurred. In 
that case we can't evaluate the P{i~^) directly, but 
we can write the following system of equations for 
the P{i~^) and the P{i~^^), where is the hypoth- 
esis that the second-to-last changepoint is at i. This 
system is the kernel of the algorithm we are propos- 
ing. 

Pii+) = Pii+\H^) P{H^) + Y,P{i-^\j++) Pij++) 

j<i 

p{i++) = Y^p{i++\k+) p{k+) 

k>i 

(17) 

where 

• is the hypothesis that there are fewer than 
two changepoints during the observed interval 
(the subscript is meant to look like "zero or 
one"): 

P(i/0) = l-^P(j++) (18) 

^This is an improper prior that is proportional to the ac- 
tual probabihties. 
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• P{i~^\j~^~^) is the probability that i is the last 
changepoint, given that j is the second-to-last. 
Computing this probability reduces to the prob- 
lem in Section [m because if j is the second-to- 
last changepoint, we know that there is exactly 
one changepoint between j and n. 

• P{i~^\H0) is the probability that i is the last 
changepoint, given that there are fewer than 
two changepoints in the interval. This reduces 
to the problem in Section 13.21 where we know 
that there are zero or one changepoints. 

• P{i^^\k^) is the probability that i is the 
second-to-last changepoint given that k is the 
last. If fc is a changepoint, then all data after 
time k is irrelevant, so P(i~^^\k~^) = Pk{i~^), 
where P^ indicates the probability that was 
computed at time k. If we store previous re- 
sults, we don't have to compute this value; we 
can just look it up. 

At each time step, we have to recompute P{i~^\j^^) 
and P{i~^\H0). Then we estimate a solution to the 
system of equations for P{i'^) and P(i^+) by Ja- 
cobi's method. Since the result from the previous 
time step is a good starting approximation, one it- 
eration is usually sufficient. 

3.4 Time and space requirements 

After we have seen and processed n data points, we 
have accumulated the following data: 

• Pfc («"*") for all < n and i < k, which is roughly 

entries. 

• The partial sums 

k k 



J2 Xj and Yj • 



(19) 



j=i 



for all A; < n and i < k. In total we have to 
store roughly n? partial sums. 
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Figure 1: A time series with two changepoints 
(marked with gray lines), and estimated cumulative 
probabilities P{i^) and P(i++). 

• Compute the probabilities P„+i(i"'") and 



Pr 



^) for all i (Equation [T7I) . 



Updating the partial sums is linear in n, but com- 
puting the likelihoods and probabilities takes time 
proportional to v?. 

This complexity is a drawback of this algorithm 
in an application where the time between change- 
points is long. But if there are fewer than 1000 
timesteps between changepoints, this algorithm is 
feasible for online applications with modest real time 
constraints (< 10 timesteps per second). 



When we see the next data point, Xn+i, we have 3.5 An example 



to 



Update the partial sums (Equation I19p . 

Compute the likelihood functions (Equa- 
tion E]). 



Figure [T] shows a time series with changepoints at 
t = 50 and t = 100. At the first changepoint, ^ 
shifts from -0.5 to 0.5. At the second changepoint, 
it shifts to 0. Throughout, a is 1.0. 

The bottom graph shows the cumulative sums of 
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Figure 2: Average annual flow in the Nile river at 
Aswan from 1871 to 1970, and the estimated P{i~^) 
aXk = 33,66,99. 

P{i~^) and P{i'^^), estimated at t = 150. Appropri- 
ately, P{i'^) shows a large mode near t = 100, which 
actually is the most recent changepoint. There is a 
smaller mode near i = 50, the location of the second- 
most-recent changepoint, and another near t = 142, 
which is due to chance. 

The estimated probabilities are not normalized, 
so they don't always add up to one. In this case, 
the sum of P{i'^^) for all i is 0.96, which indicates a 
4% chance that we have not seen two changepoints 
yet. 

3.6 Generalization to unknown /iq, cr^ 

A feature of the proposed algorithm is that it ex- 
tends easily to handle the cases when //q or a^, 
or both, are unknown. There are two possible ap- 
proaches. 

The simpler option is to estimate //q and/or cr^ 
from the data and then plug the estimates into 
Equation [TTJ This approach works reasonably well. 



but because estimated parameters are taken as 
given, it tends to overestimate the probability of a 
changepoint. 

An alternative is to use the observed y and to 
compute the posterior distributions for ^0 and/or cr^ 
using Equations [8] and [TOl and then make a random 
choice from those distributions (just as we did for /ii 
in Section [3TT]) . 

As an example, we apply our algorithm to the 
ubiquitous Nile data, a standard example in change- 
point detection since Cobb's seminal paper in 1978 
[5]. This dataset records the average annual flow in 
the Nile river at Aswan from 1871 to 1970. 

Figure [2] shows this data along with the estimated 
probability P{i'^) computed at three different times 
(after 33, 66 and 99 years). The salient feature at 
all three times is a high probability of a changepoint 
near 1898, which is consistent with the conclusion of 
other authors working with this data. At k = 66, 
there is also a high probability of a changepoint near 
1920, but by A; = 99 that hypothesis has all but 
vanished. Similarly, at k = 99, there is a small 
probability that the last four data points, all on the 
low side, are evidence of a changepoint. 

As time progresses, this algorithm is often quick 
to identify a changepoint, but also quick to "change 
its mind" if additional data fail to support the first 
impression. Part of the reason for this mercurial 
behavior is that P{i'^) is the probability of being the 
last changepoint; so a recent, uncertain changepoint 
might have a higher probability than a more certain 
changepoint in the past. 

3.7 Detecting changes in variance 

To detect changes in variance, we can generalize the 
algorithm further by estimating a separately before 
and after the changepoint. In this case, the likeli- 
hood function is 



P{X\i^ 



lii,al) (20) 



j=i+i 



where /xq a-nd cjg are drawn from the posterior distri- 
bution for the mean and variance before the change- 
point, and likewise fii and a\ after the breakpoint. 

Figure [3] shows a time series with a change in mean 
followed by a change in variance. Before the first 
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Figure 3: A time series with a change in mean and 
a change in variance. 

changepoint, /i,cj = (1,1); after the first change- 
point, fi,a = (0,1); and after the second change- 
point, iJ.,a = (0, 0.5). 

Correctly, the estimated P{i^) indicate a high 
probabihty near the last changepoint. The esti- 
mated P{i~^^) indicate less certainty about the lo- 
cation of the second-to-last changepoint. 

For this example the generalized algorithm works 
well, but its behavior is erratic when a series of sim- 
ilar values appear in the time series. In that case 
the estimated values of af tend to be small, so the 
computed probability of a changepoint is unreason- 
ably high. We are considering ways to mitigate this 
effect without impairing the sensitivity of the algo- 
rithm too much. 

4 Evaluation 

The examples in the previous section show that the 
estimated values are reasonable, but not that they 
are accurate or more useful than results from other 
algorithms. Since there are no previous algorithms 



for computing P{i^), there is no direct basis for 
comparison, but there are two evaluations we could 
perform: 

• There are Bayesian methods that estimate 
P{i)', that is, the probability that there is a 
changepoint at any time i. From this we can 
compute P{i~^) and compare to our algorithm. 
We would like to pursue this approach in future 
work. 

• Changepoint detection algorithms are often 
used "online" to raise an alarm when the 
evidence for a changepoint reaches a certain 
threshold. These algorithms are relatively sim- 
ple to implement, and the criteria of merit are 
clear, so we start by evaluating our algorithm 
in this context. 

The goal of online detection is to raise an alarm as 
quickly as possible after a changepoint has occurred 
while minimizing the frequency of false alarms. 

Basseville and Nikiforov [T] propose a criterion for 
comparing algorithms in the case where an actual 
changepoint is equally likely to occur at any time, 
so the time of the first changepoint, to is distributed 
geometrically. 

During any given run, an algorithm has some 
probability of raising an alarm before to; this false 
alarm probability is denoted a. Then, for a constant 
value of a, the goal is to minimize the mean delay, 
T = ta — to + 1, where ta is the time of the alarm. 
In our experiements, we use a trimmed mean, for 
reasons explained below. 

4.1 GLR 

The gold standard algorithm for online changepoint 
detection is CUSUM (cumulative sum), which is op- 
timal, in the sense of minimizing mean delay, when 
the parameters of the distribution are known before 
and after the changepoint [Ij. 

When the parameters after the changepoint are 
not known, CUSUM can be generalized in several 
ways, including Weighted CUSUM and GLR (gen- 
eralize likelihood ratio). 

For the experiments in this section, we consider 
the case where the distribution of values before the 
changepoint is normal with known parameters ^q, 
cr^; after the changepoint, the distribution is normal 
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with unknown /xi and the same known variance. We 
compare the performance of our algorithm to GLR, 
specifically using the decision function in Equa- 
tion 2.4.40 from Basseville and Nikiforov jl]. 

The decision function g/. is constructed so that its 
expected value is zero before the changepoint and in- 
creasing afterwards. When gf^ exceeds the threshold 
h, an alarm is raised, indicating that a changepoint 
has occurred (but not when). The threshold h is 
a parameter of the algorithm that can be tuned to 
trade off between the probability of a false alarm 
and the mean delay. 

4.2 CPP 

It is straightforward to adapt our algorithm so that 
it is comparable with GLR. First, we adapt the like- 
lihood function to use the known values of fiQ and a, 
so only is unknown. Second, we use the decision 
function: 

k 

g, = Y^P(i+) (21) 

i=0 

This gk is the probability that a changepoint has 
occurred somewhere before time k. Our algorithm 
provides additional information about the location 
of the changepoint, but this reduction eliminates 
that information. 

We call this adapted algorithm CPP, for "change 
point probability", since the result is a probability 
(at least in the Bayesian sense) rather than a likeli- 
hood ratio. 

Both GLR and CPP have one major parameter, 
/i, which has a strong effect on performance, and a 
minor parameter that has little effect over a wide 
range. For GLR, the minor parameter is the min- 
imum change size, i^mm- For CPP, it is the prior 
probability of a changepoint, /. 

4.3 Results 

In our first experiment, /xq, ct = (0, 1) and /ii = 1, so 
the signal to noise ratio is S/N = — /io|/c = 1. In 
the next section we will look at the effect of varying 
S/N. 

In each trial, we choose the actual value of to from 
a geometric distribution with parameter p, so 

P(to = n) = (l-y9)"-V (22) 




Ol 
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Figure 4: Mean delay versus false alarm probability 
for a range of threshold values. 

We set p = 0.02, so the average value of to is 50. 

Each trial continues until an alarm is raised; the 
alarm time is denoted ta- If ta < tQ, it is considered 
a false alarm. Otherwise, we compute the delay time 

ta-to + 1. 

To compute mean delay, we average over the trials 
that did not raise a false alarm. We use a trimmed 
mean (discarding the largest and smallest 5%) for 
several reasons: (1) the trimmed mean is generally 
more robust, since it is insensitive to a small num- 
bers of outliers, (2) very short delay times are some- 
times "right for the wrong reason"; that is, the al- 
gorithm was about to generate a false positive when 
the changepoint occurred, and (3) when to is small, 
we see very few data points before the changepoint 
and ta can be arbitrarily large. In our trials, we quit 
when k > to + 100 and set ta = inf. The trimmed 
mean allows us to generate a meaningful average, 
as long as fewer than 5% of the trials go "out of 
bounds" . 

The performance of both algorithms depends on 
the choice of the threshold. To compare algorithms, 
we plot mean delay versus false alarm probability for 
a range of thresholds. Then for a fixed false alarm 
probability, a, we can see which algorithm yields a 
shorter mean delay. 

Figure U] shows the results averaged over 1000 
runs. Both algorithms use the same random number 
generator, so they see the same data. 

Across a wide range of a, the mean alarm time for 
CPP is slightly better than for GLR; the difference 
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Figure 5: Mean delay versus a with false alarm prob- 
ability a = 0.05. 

is 0.5-1.0 time steps. For example, at a = 0.05, 
the mean alarm time is 9.7 for CPP and 10.7 for 
GLR (estimated by linear interpolation between 
data points). 

This advantage increases as a increases (or equiv- 
alently as S/N decreases). Figure [5] shows the mean 
alarm time with a = 0.05 for a range of values 
of a. When a is small, the mean alarm time is 
small, and GLR has an advantage. But for a > 0.7 
{S/N < 1.4) CPP is consistently better. 

We conclude that CPP is as good as GLR, or 
better, over a wide range of scenarios, and that the 
advantage increases as the signal-to-noise ratio de- 
creases. 

5 Conclusions 

We have proposed a novel algorithm for estimating 
the location of the last changepoint in a time se- 
ries that includes abrupt changes. The algorithm is 
Bayesian in the sense that the result is a joint poste- 
rior distribution of the parameters of the process, as 
opposed to a hypothesis test or an estimated value. 
Like other Bayesian methods, it requires subjective 
prior probabilities, but if we assume that change- 
points are equally likely to occur at any time, the 
required prior is a single parameter, /, the proba- 
bility of a changepoint. 

This algorithm can be used for changepoint detec- 
tion, location estimation, or tracking. In this paper, 
we apply it to the changepoint detection problem 



and compare it with a standard algorithm for this 
problem, GLR. We find that the performance is as 
good as GLR, or better, over a wide range of sce- 
narios. 

In this paper, we focus on problems where val- 
ues between changepoints are known to be normally 
distributed. In some cases, data from other distri- 
butions can be transformed to normal. For exam- 
ple, several Internet characteristics are lognormal 
[12j . which means that they are normal under a log 
transform. Alternatively, it is easy to extend our 
algorithm to handle any distribution whose pdf can 
be computed, but there is likely to be a performance 
penalty (the normal assumption is convenient in our 
implementation because its sufficient statistics can 
be updated incrementally with only a few opera- 
tions). 

We think that the proposed algorithm is promis- 
ing, and makes a novel contribution to the statistics 
of changepoint detection. But this work is at an 
early stage; there is still a lot to do: 

1. We have evaluated the proposed algorithm in 
the context of the changepoint detection prob- 
lem, and demonstrated its use for estimating 
the location of a changepoint, but we have not 
evaluated its performance for location estima- 
tion and tracking. These problems take advan- 
tage of the full capability of the algorithm, so 
they will test it more rigorously. 

2. In Section 13.71 we identified an intermittent 
problem when we use our algorithm to detect 
changes in variance. We are working on miti- 
gating this problem. 

3. A drawback of the proposed algorithm is that 
it requires time proportional to at each time 
step, where n is the number of steps since the 
second-to-last changepoint. If the time be- 
tween changepoints is more than a few thou- 
sand steps, it may be necessary to desample 
the data to control run time. 

4. So far we have not addressed the autocorrela- 
tion structure that has been seen in many net- 
work measurements, but this work raises an in- 
triguing possibility. If a time series actually 
contains abrupt changepoints, but no correla- 
tion structure between them, it will appear to 
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have autocorrelation if we ignore the change- 
points. In future work we plan to nsc change- 
point detection to divide a time series of mea- 
surements into stationary intervals, and then 
investigate the autocorrelation structure within 
intervals to see whether it is possible that ob- 
served correlations are (at least in part) an ar- 
tifact of stationary models. 
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